Computer Science 1

Computer lab 07

Course Number: BTPS1210BA-E
Contact Information: Please contact me through email anytime:
Author

Kálmán Abari

Published

August 31, 2024

Learning outcomes
Students will learn how to transform data.

Jamovi - Data transformation

Problem 1

In this section, you will learn how to perform unit conversions.

In dataset_xlsx.omv (see Lab05 Problems) variable Height and Weight are measured in inches and pounds. Convert them centimeters and kilograms.

Important systems of measurement

The International System of Units (SI), often referred to as the metric system, is the most widely used system of measurement in the world today. It provides a standardized way of quantifying physical quantities such as length, mass, time, and temperature, making it easier for scientists, engineers, and people across different countries to communicate and work with consistent measurements. The SI system is based on a set of fundamental units and prefixes that allow for easy scaling of units across orders of magnitude.

Some important features of the SI system include seven base units, from which all other units are derived:

  • Meter (m): The unit of length.
  • Kilogram (kg): The unit of mass.
  • Second (s): The unit of time.
  • Ampere (A): The unit of electric current.
  • Kelvin (K): The unit of temperature.
  • Mole (mol): The unit of amount of substance.
  • Candela (cd): The unit of luminous intensity.

The SI system also incorporates a range of prefixes, such as kilo-, centi-, and milli-, to modify these base units and make them suitable for various applications.

Apart from the SI system, there are other important systems of measurement used in specific contexts:

  • Imperial System: Commonly used in the United States, this system employs units like inches, feet, pounds, and gallons for length, weight, and volume measurements.

  • U.S. Customary System: Similar to the Imperial system, it’s used in the United States, but there are slight differences in some units.

  • British Imperial System: Used in the UK and some other Commonwealth countries, it has units like the inch, foot, pound, and gallon.

Each system has its own historical and cultural significance and is applied in specific fields or regions, but the SI system has gained global acceptance as the standard system of measurement, facilitating international scientific and technological collaboration.

  1. Open dataset_xlsx.omv (see Lab05 Problems) in jamovi.

  2. How to convert inches to centimeters? Variable Height measured in inches. We activate the variable to be converted, which is now Height.

Variable to be converted
  1. Select Data / Compute menu.

Insert a new variable
  1. Rename the variable to Height_cm, and add description Heights in centemiters

Add name and description
  1. Since 1 inch is equal to 2.54 centimeters (\(1″ = 2.54cm\)), type the following in the expression field: ROUND(Height*2.54)

Calculation
  1. How to convert pounds to kilograms? Variable Weight measured in pounds. We activate the variable to be converted, which is now Weight.

Variable to be converted
  1. Select Data / Compute menu.

Insert a new variable
  1. Rename the variable to Weight_kg, and add description Weights in kilograms

Add name and description
  1. Since 1 kg is equal to 2.02 pounds (\(1 kg = 2.20462262185 lb\)), type the following in the expression field: ROUND(Weight/2.2046)

Calculation
  1. Send the file dataset_xlsx.omv to the abari.kurzus@gmail.com. The subject of this email is Lab07 - Problem 1.

Problem 2

In this section, you will learn how to insert computed variable.

A new variable may be based on more than one variable. In this problem, we wish to compute BMI for the respondents in our sample. The height (in centimeters) and weight (in kilograms) of the respondents were observed; so to compute BMI, we want to plug those values into the formula

\[BMI=\frac{\text{Weight in kg}}{(\text{Height in m})^2}\]

  1. Use dataset_xlsx.omv from Problem 1.

  2. Append a new computed variable to the dataset. Select Data / Add / Computed variable / Append menu.

Append computed variable

Appended variable
  1. Rename the variable to BMI, and add description Body Mass Index.

Add name and description
  1. Type the following in the expression field: ROUND(Weight_kg/(Height_cm/100)^2, 2)

Calculation
  1. Send the file dataset_xlsx.omv to the abari.kurzus@gmail.com. The subject of this email is Lab07 - Problem 2.

Problem 3

In this section, you will learn how to calculate the Z-score of a variable.

In dataset_xlsx.omv (see Lab05 Problems) calcuate Z-scores of variables English, Reading, Math, and Writing. Name them as ZEnglish, ZReading, ZMath, and ZWriting.

Z-score

Calculating the Z-score of a variable is a statistical technique used to standardize and normalize data. It’s primarily done for two main reasons:

  1. Comparing Different Distributions
    Z-scores allow you to compare data points from different datasets or variables with different units of measurement. By converting the data into a common scale, you can make meaningful comparisons. This is especially useful when working with data from various sources or when you want to assess how a data point compares to the rest of the data in a distribution.

  2. Identifying Outliers
    Z-scores help in identifying outliers within a dataset. An outlier is an observation that deviates significantly from the rest of the data. Z-scores make it easier to set a threshold (e.g., ±2 or ±3 standard deviations) to flag values as outliers. This is because, in a standardized distribution, most data points should fall within a certain range around the mean, and those outside this range can be considered outliers.

The formula to calculate the Z-score for an individual data point is:

\[Z=\frac{X-\mu}{\sigma}\]

Where:

  • \(Z\) is the Z-score.
  • \(X\) is the individual data point.
  • \(\mu\) is the mean (average) of the dataset.
  • \(\sigma\) is the standard deviation of the dataset.

Here’s how Z-scores are typically used:

  • If a Z-score is close to 0, it means the data point is close to the mean.
  • A positive Z-score indicates that the data point is above the mean.
  • A negative Z-score indicates that the data point is below the mean.

By standardizing the data with Z-scores, you can make informed decisions about the relative position of data points and identify unusual or extreme values. This can be valuable in various fields, including finance, quality control, and scientific research, where understanding data distribution and detecting outliers are important for analysis and decision-making.

In jamovi, you can compute standardized scores for numeric variables automatically using Z() function. One important distinction is that the standardized values of the “raw” scores will be centered about their sample means and scaled (divided by) their sample standard deviations; that is:

\[z=\frac{x-\bar{x}}{s}\]

  1. Open dataset_xlsx.omv (see Lab05 Problems) in jamovi.

  2. In this section, we only present the Z-transformation of the variable English into a new variable ZEnglish. We leave the transformation of the other variables to the Reader. Activate the variable to be Z-scored, which is now English.

Variable to be Z-scored
  1. Select Data / Compute menu.

Insert a new variable
  1. Rename the variable to ZEnglish, and add description Z-score of variable 'English'

Add name and description
  1. Type the following in the expression field: Z(English)

Calculation of Z-score
  1. Repeat the above steps for Reading, Math, and Writing variables.

Calculation of Z-scores
  1. Send the file dataset_xlsx.omv to the abari.kurzus@gmail.com. The subject of this email is Lab07 - Problem 3.

Problem 4

In this section, you will learn how to merging categories.

Class ranks for high schools and colleges are are nicknames for what year of study the person is completing: “freshman” (first-year), “sophomore” (second-year), “junior” (third-year), “senior” (fourth-year). Class ranks are also sometimes divided into “underclassmen” (first or second-year students) and “upperclassmen” (third or fourth-year students).

In the dataset_xlsx.omv (see Lab05 Problems), the variable Rank has the categories Freshman (1), Sophomore (2), Junior (3), and Senior (4). Let’s merge the categories and create a new indicator variable called RankIndicator with the levels Underclassman (1) and Upperclassman (2).

  1. Open dataset_xlsx.omv (see Lab05 Problems) in jamovi.

  2. Activate the variable to be transformed, which is now Rank.

Variable to be transformed
  1. Select Data / Transform menu.

Insert a new variable
  1. Rename the variable to RankIndicator, and add description Merging the categories of 'Rank'. Two levels: "underclassmen" (first or second-year students) and "upperclassmen" (third or fourth-year students)

Add name and description
  1. Make sure ‘Source Variable’ is set to Rank and select ‘Create New Transform’ from ‘Use Transform’.

Create new transform

New transform
  1. Rename the transform to Merging Rank categories, and add description Two levels: "underclassmen" and "upperclassmen".

Add name and description
  1. Clicking twice on ‘Add recode condition’ and typing the following text create this state. Select ‘Ordinal’ from the ‘Measure type’.

Clicking on ‘Add recode condition’

Recode condition (1-2)

Recode condition (else)
  1. Close this Transform.

Closing transform
  1. Close this Transformation.

Closing transformation
  1. Send the file dataset_xlsx.omv to the abari.kurzus@gmail.com. The subject of this email is Lab07 - Problem 4.

Problem 5

In this section, you will learn how to discretizing a continuous variable.

One important use of the tranformation is dichotomizing or discretizing a continuous variable. Dichotomizing a continuous variable transforms an interval/rate variable into a binary categorical variable by splitting the values into two groups based on a cut point. Discretizing a continuous variable transforms an interval/rate variable into an ordinal categorical variable by splitting the values into three or more groups based on several cut points.

In the sample dataset, the variable CommuteTime represents the amount of time (in minutes) it takes the respondent to commute to campus. Let’s try recoding this variable into three ordinal groups:

  • 1 = Commute is 30 minutes or less (time < 30)
  • 2 = Commute is more than 30 minutes, but less than 60 minutes (30 < time < 60)
  • 3 = Commute is an hour or more (time > 60)

Give the new variable the name CommuteLength.

  1. Open dataset_xlsx.omv (see Lab05 Problems) in jamovi.

  2. Activate the variable to be transformed, which is now CommuteTime.

Variable to be transformed
  1. Select Data / Transform menu.

Insert a new variable
  1. Rename the variable to CommuteLength, and add description Discretizing a variable 'CommuteTime'

Add name and description
  1. Make sure ‘Source Variable’ is set to CommuteTime and select ‘Create New Transform’ from ‘Use Transform’.

Create new transform

New transform
  1. Rename the transform to Discretizing a CommuteTime, and add description Three levels.

Add name and description
  1. Clicking twice on ‘Add recode condition’ and typing the following text create this state. Select ‘Ordinal’ from the ‘Measure type’.

Clicking on ‘Add recode condition’

Recode conditions
  1. Send the file dataset_xlsx.omv to the abari.kurzus@gmail.com. The subject of this email is Lab07 - Problem 5.

Problem 6

In this section, you will learn how to filter cases.

jamovi comes the ability to filter out rows that you don’t want included in your analyses. There are a number of reasons why this might be appropriate. For example, you might want to only include people’s survey responses if they explicitly consented to having their data used, or you might want to exclude all left-handed people, or perhaps people who score ‘below chance’ in an experimental task. In some cases you just want to exclude extreme scores, for example those that score more than 3 standard deviations from the mean.

In our example, let’s select those male athletes whose height is greater than 190 cm.

  1. Open dataset_xlsx.omv (see Lab05 Problems) in jamovi.

  2. Select Data / Filter menu. Make the filter settings as shown in the pictures. As you can see, in the status bar, the number of filtered rows are 417.

Filter panel

Filter conditions
  1. Make the filters inactive without deleting them. As you can see, in the status bar, the number of filtered rows are 0.

Inactive filters
  1. Send the file dataset_xlsx.omv to the abari.kurzus@gmail.com. The subject of this email is Lab07 - Problem 6.

Problem 7

In this section, you will learn how to evaluate a psychology test.

The dataset_xlsx.omv (see Lab05 Problems) has four test scores:

  • English - Score on English placement test (out of 100 points) Numeric Scale
  • Reading - Score on Reading placement test (out of 100 points) Numeric Scale
  • Math - Score on Math placement test (out of 100 points) Numeric Scale
  • Writing - Score on Writing placement test (out of 100 points)
  1. Insert two new variables, AverageScore and SumScore, that will be calculated as the mean and sum of the four test scores. Use the MEAN() and SUM() functions in jamovi.

  2. Send your dataset_xlsx.omv to the abari.kurzus@gmail.com. The subject of this email is Lab 07 - Problem 7.

Scoring methods

Whether to use the sum or average (mean) to evaluate a psychology test depends on the nature of the test, the specific research objectives, and the constructs being measured. Both methods have their advantages and can provide valuable insights, but they serve different purposes:

  1. Sum (Total) Score:

    • Advantages: The total score provides a straightforward way to aggregate individual item or question responses. It represents the overall magnitude or intensity of the construct being measured. This is useful when you want to assess the cumulative effect of responses on a single construct.
    • Use Cases: Sum scores are commonly used in surveys or tests where responses are additive, such as in personality assessments, where multiple items contribute to a single trait score.
  2. Average (Mean) Score:

    • Advantages: The mean score represents the central tendency of responses and is particularly useful when you want to gauge the typical or average level of a construct. It can be less sensitive to extreme values compared to sum scores.
    • Use Cases: Mean scores are often used when responses to individual items represent separate dimensions or facets of a construct, such as in multi-dimensional personality tests or questionnaires measuring diverse aspects of a behavior or trait.

In many cases, psychology tests and surveys use a combination of scoring methods. For instance, a test may provide both total and subscale scores to capture both the overall construct and its various dimensions. Researchers should carefully consider the characteristics of the test and the research context to determine which scoring method or combination of methods is most appropriate for their specific needs.

Problem 8

In this section, you will learn how to identify outliers.

The dataset_xlsx.omv has a variable Height_cm (see Problem 2). Insert 4 dichotomous variables (with values 0 and 1, where 1 means that the given observation in Height_cm is an outlier):

  • Height_cm_oZ2 - observations that are more than 2 standard deviations away from the mean would be marked with a 1.
  • Height_cm_oZ3 - observations that are more than 3 standard deviations away from the mean would be marked with a 1.
  • Height_cm_IQR1.5 - observations that fall below Q1 - 1.5 * IQR or above Q3 + 1.5 * IQR would be marked with a 1.
  • Height_cm_IQR2.5 - observations that fall below Q1 - 2.5 * IQR or above Q3 + 2.5 * IQR would be marked with a 1.
  1. Insert two 4 new variables above.

  2. Using a frequency table, show the distribution of the above four variables, that is, we need to know how many outliers there are in each case.

  3. Create an html file outliers.html containing the images of the 4 frequency tables above, plus a short explanation for each of the 4 cases.

  4. Send your outliers.html to the abari.kurzus@gmail.com. The subject of this email is Lab 07 - Problem 8.

Outliers

In statistics, an outlier is an observation or data point that significantly deviates from the rest of the data in a dataset. Outliers can be unusually high or low values and are sometimes referred to as “extreme values” or “anomalies.” These data points are often considered to be atypical or exceptional compared to the majority of the data.

Outliers can occur for various reasons, including measurement errors, data entry errors, natural variability, or genuine anomalies in the data. Identifying and handling outliers is an essential step in data analysis because they can have a substantial impact on statistical summaries and analysis results.

Here are a few methods for identifying outliers:

  1. Visual Inspection: Plotting the data using graphs, such as histograms, box plots, or scatter plots, can help identify potential outliers. Points that are visibly distant from the bulk of the data may be outliers.

  2. Z-Scores: Calculate the Z-score for each data point. Z-scores measure how many standard deviations a data point is from the mean. Data points with Z-scores significantly greater or smaller than a specified threshold (e.g., ±2 or ±3 standard deviations) are considered outliers.

  3. Interquartile Range (IQR): The IQR is the range between the first quartile (Q1, 25th percentile) and the third quartile (Q3, 75th percentile) of the data. Data points that fall below Q1 - 1.5 * IQR or above Q3 + 1.5 * IQR are often considered outliers.

  4. Expert Knowledge: In some cases, domain knowledge or subject-matter expertise is used to identify outliers, especially when unusual values have a logical explanation in the context of the data.